A simple (ie. no error checking or sensible engineering) notebook to extract the student answer data from a single xml file.

I'll also export the data to a csv file at the end of this, so that it's easy to read in at the beginning of another notebook.

Following discussions with Suraj, we want the representation to take into account the student's response, the official answer, and the grade. So there'll be a little fiddliness linking the student response back to the gold standard response.

So, first read the file:


In [3]:
filename='semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.xml'

It's an xml file, so we'll need the xml.etree parser, and pandas so that we can import into a dataframe:


In [4]:
import pandas as pd

from xml.etree import ElementTree as ET

In [7]:
tree=ET.parse(filename)

r=tree.getroot()

Now, the reference answers are in the second daughter node of the tree. We can extract these and store them in a dictionary. To distinguish between reference answer tokens and student response tokens, I'm going to append each token in the reference answers with _RA, and each of the tokens in a student response with _SR.


In [30]:
from string import punctuation

def to_tokens(textIn):
    '''Convert the input textIn to a list of tokens'''
    tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
    # remove any empty tokens
    return [t for t in tokens_ls if t]

str='"Help!" yelped the banana, who was obviously scared out of his skin.'
print(str)
print(to_tokens(str))


"Help!" yelped the banana, who was obviously scared out of his skin.
['help', 'yelped', 'the', 'banana', 'who', 'was', 'obviously', 'scared', 'out', 'of', 'his', 'skin']

In [50]:
refAnswers_dict={refAnswer.attrib['id']:[t+'_RA' for t in to_tokens(refAnswer.text)] 
                 for refAnswer in r[1]}    
refAnswers_dict


Out[50]:
{'answer204': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'separated_RA',
  'by_RA',
  'the_RA',
  'gap_RA'],
 'answer205': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'not_RA',
  'connected_RA'],
 'answer206': ['terminal_RA',
  '1_RA',
  'is_RA',
  'connected_RA',
  'to_RA',
  'the_RA',
  'negative_RA',
  'battery_RA',
  'terminal_RA'],
 'answer207': ['terminal_RA',
  '1_RA',
  'is_RA',
  'not_RA',
  'separated_RA',
  'from_RA',
  'the_RA',
  'negative_RA',
  'battery_RA',
  'terminal_RA'],
 'answer207.NEW': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'battery_RA',
  'terminal_RA',
  'are_RA',
  'in_RA',
  'different_RA',
  'electrical_RA',
  'states_RA']}

Next, we need to extract each of the student responses. These are in the third daughter node:


In [41]:
print(r[2][0].text)
r[2][0].attrib


positive battery terminal is separated by a gap from terminal 1
Out[41]:
{'accuracy': 'correct',
 'answerMatch': 'answer204',
 'count': '1',
 'id': 'FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.sbj3-l1.qa193'}

In [58]:
responses_ls=[]
for (i, studentResponse) in enumerate(r[2]):
    if 'answerMatch' in studentResponse.attrib:
        matchTokens_ls=refAnswers_dict[studentResponse.attrib['answerMatch']]
    else:
        matchTokens_ls=[]
    responses_ls.append({'accuracy':studentResponse.attrib['accuracy'],
                         'text':studentResponse.text,
                         'tokens':[t+'_SR' for t in to_tokens(studentResponse.text)] + matchTokens_ls})

responses_ls[36]


Out[58]:
{'accuracy': 'correct',
 'text': 'the positive battery terminal and terminal 1 are not connected',
 'tokens': ['the_SR',
  'positive_SR',
  'battery_SR',
  'terminal_SR',
  'and_SR',
  'terminal_SR',
  '1_SR',
  'are_SR',
  'not_SR',
  'connected_SR',
  'terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'not_RA',
  'connected_RA']}

OK, that seems to work OK. Now, let's define a function that takes a filename, and returns the list of token dictionaries:


In [66]:
def extract_token_dictionaries(filenameIn):
    
    # Localise the to_tokens function
    def to_tokens_local(textIn):
        '''Convert the input textIn to a list of tokens'''
        tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
        # remove any empty tokens
        return [t for t in tokens_ls if t]

    tree=ET.parse(filenameIn)
    root=tree.getroot()
    
    refAnswers_dict={refAnswer.attrib['id']:[t+'_RA' for t in to_tokens_local(refAnswer.text)]
                     for refAnswer in root[1]}

    responsesOut_ls=[]
    for (i, studentResponse) in enumerate(root[2]):
        if 'answerMatch' in studentResponse.attrib:
            matchTokens_ls=refAnswers_dict[studentResponse.attrib['answerMatch']]
        else:
            matchTokens_ls=[]
        responsesOut_ls.append({'accuracy':studentResponse.attrib['accuracy'],
                                'text':studentResponse.text,
                                'tokens':[t+'_SR' for t in to_tokens_local(studentResponse.text)] \
                                          + matchTokens_ls})
    return responsesOut_ls

We now have a function which takes a filename and returns a list of tokenised student responses and reference answers:


In [68]:
extract_token_dictionaries(filename)[:2]


Out[68]:
[{'accuracy': 'correct',
  'text': 'positive battery terminal is separated by a gap from terminal 1',
  'tokens': ['positive_SR',
   'battery_SR',
   'terminal_SR',
   'is_SR',
   'separated_SR',
   'by_SR',
   'a_SR',
   'gap_SR',
   'from_SR',
   'terminal_SR',
   '1_SR',
   'terminal_RA',
   '1_RA',
   'and_RA',
   'the_RA',
   'positive_RA',
   'terminal_RA',
   'are_RA',
   'separated_RA',
   'by_RA',
   'the_RA',
   'gap_RA']},
 {'accuracy': 'correct',
  'text': 'terminal 1 is not connected to the positive terminal',
  'tokens': ['terminal_SR',
   '1_SR',
   'is_SR',
   'not_SR',
   'connected_SR',
   'to_SR',
   'the_SR',
   'positive_SR',
   'terminal_SR',
   'terminal_RA',
   '1_RA',
   'and_RA',
   'the_RA',
   'positive_RA',
   'terminal_RA',
   'are_RA',
   'not_RA',
   'connected_RA']}]

So next we need to be able to build a document frequency dictionary from a list of tokenised documents.


In [73]:
def document_frequencies(listOfTokenLists):
    # Build the dictionary of all tokens used:
    token_set=set()
    for tokenList in listOfTokenLists:
        token_set=token_set.union(set(tokenList))
        
    # Then return the document frequency counts for each token
    
    return {t:len([l for l in listOfTokenLists if t in l])
            for t in token_set}

In [81]:
tokenLists_ls=[x['tokens'] for x in extract_token_dictionaries(filename)]
document_frequencies(tokenLists_ls)


Out[81]:
{'1.5_SR': 3,
 '1_RA': 55,
 '1_SR': 40,
 '2_SR': 1,
 'a_SR': 31,
 'and_RA': 48,
 'and_SR': 20,
 'answer_SR': 1,
 'any_SR': 1,
 'are_RA': 48,
 'are_SR': 12,
 'aren"t_SR': 1,
 'at_SR': 3,
 'batteries_SR': 1,
 'battery"s_SR': 1,
 'battery_RA': 7,
 'battery_SR': 39,
 'becaquse_SR': 1,
 'because_SR': 28,
 'becuase_SR': 1,
 'between_SR': 9,
 'both_SR': 2,
 'bulb_SR': 7,
 'by_RA': 26,
 'by_SR': 10,
 'c_SR': 1,
 'charge_SR': 2,
 'circuit_SR': 3,
 'closed_SR': 1,
 'closing_SR': 1,
 'components_SR': 1,
 'connected_RA': 29,
 'connected_SR': 50,
 'connection_SR': 5,
 'contact_SR': 1,
 'created_SR': 1,
 'damaged_SR': 3,
 'difference_SR': 1,
 'different_SR': 3,
 'dint_SR': 1,
 'direct_SR': 1,
 'do_SR': 1,
 'dont_SR': 1,
 'each_SR': 2,
 'electrical_SR': 3,
 'end_SR': 1,
 'from_SR': 6,
 'gap_RA': 26,
 'gap_SR': 27,
 'gaps_SR': 1,
 'get_SR': 1,
 'had_SR': 2,
 'has_SR': 1,
 'have_SR': 1,
 'he_SR': 2,
 'i_SR': 4,
 'in_SR': 4,
 'is_RA': 7,
 'is_SR': 54,
 'it_SR': 6,
 'its_SR': 2,
 'know_SR': 2,
 'making_SR': 1,
 'me_SR': 1,
 'negative_RA': 7,
 'negative_SR': 13,
 'no_SR': 9,
 'not_RA': 22,
 'not_SR': 26,
 'of_SR': 2,
 'on_SR': 2,
 'one_SR': 8,
 'other_SR': 2,
 'path_SR': 1,
 'positive_RA': 48,
 'positive_SR': 52,
 'posittive_SR': 1,
 'positve_SR': 1,
 'postive_SR': 1,
 'psoitive_SR': 1,
 'reading_SR': 1,
 'same_SR': 1,
 'separated_RA': 26,
 'separated_SR': 6,
 'separates_SR': 1,
 'separation_SR': 2,
 'separted_SR': 1,
 'seperated_SR': 7,
 'so_SR': 1,
 'state_SR': 1,
 'states_SR': 3,
 'tell_SR': 1,
 'termianl_SR': 1,
 'terminal_RA': 55,
 'terminal_SR': 68,
 'terminals_SR': 6,
 'the_RA': 55,
 'the_SR': 71,
 'thebulb_SR': 1,
 'their_SR': 1,
 'then_SR': 2,
 'there_SR': 20,
 'they_SR': 3,
 'to_RA': 7,
 'to_SR': 42,
 'tot_SR': 1,
 'two_SR': 2,
 'understand_SR': 1,
 'v_SR': 1,
 'voltage_SR': 3,
 'was_SR': 18,
 'with_SR': 3}

Next, define a function which takes a list of tokens and a document frequency dictionary, and returns a dictionary of the tf.idf values for each of the tokens in the list. Note: for this function, if a token isn't in the document frequency dictionary, then it won't be returned in the tf.idf dictionary.

We can use the collections.Counter object to get the tf values.


In [82]:
from collections import Counter

In [86]:
def get_tfidf(tokens_ls, docFreq_dict):
    tf_dict=Counter(tokens_ls)
    return {t:tf_dict[t]/docFreq_dict[t] for t in tf_dict if t in docFreq_dict}

In [88]:
get_tfidf('the cat sat on the mat'.split(), {'cat':2, 'the':1})


Out[88]:
{'cat': 0.5, 'the': 2.0}

Finally, we want to convert the outputs for all of the responses into a dataframe.


In [105]:
# Extract the data from the file:
tokenDictionaries_ls=extract_token_dictionaries(filename)

# Build the lists of responses:
tokenLists_ls=[x['tokens'] for x in extract_token_dictionaries(filename)]

# Build the document frequency dict
docFreq_dict=document_frequencies(tokenLists_ls)

# Create the tf.idf for each response:
tfidf_ls=[get_tfidf(tokens_ls, docFreq_dict) for tokens_ls in tokenLists_ls]

# Now, create a dataframe which is indexed by the token dictionary:
trainingText_df=pd.DataFrame(index=docFreq_dict.keys())

# Use the index of responses in the list as column headers:
for (i, tokens_ls) in enumerate(tfidf_ls):
    trainingText_df[i]=pd.Series(tokens_ls, index=trainingText_df.index)

# Finally, transpose, and replace the NaNs with 0:
trainingText_df.fillna(0).T


Out[105]:
no_SR in_SR by_RA closed_SR know_SR two_SR end_SR direct_SR not_RA gap_RA ... is_RA components_SR its_SR to_RA had_SR other_SR they_SR charge_SR was_SR at_SR
0 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
1 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
2 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
3 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
4 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
5 0.111111 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
6 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
7 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
8 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
9 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
10 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
11 0.000000 0.00 0.000000 1.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
12 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
13 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
14 0.111111 0.25 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
15 0.111111 0.25 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
16 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
17 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
18 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 1.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
19 0.000000 0.25 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
20 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
21 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
22 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
23 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
24 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.5 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
25 0.000000 0.25 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
26 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.142857 0.0 0.0 0.142857 0.0 0.0 0.000000 0.0 0.000000 0.0
27 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
28 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.142857 0.0 0.0 0.142857 0.0 0.0 0.000000 0.0 0.000000 0.0
29 0.111111 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
73 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
74 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
75 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
76 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
77 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
78 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
79 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.333333 0.0 0.000000 0.0
80 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
81 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
82 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
83 0.000000 0.00 0.000000 0.0 0.5 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
84 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
85 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
86 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
87 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.333333 0.0 0.000000 0.0
88 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.333333 0.0 0.000000 0.0
89 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.142857 0.0 0.0 0.142857 0.0 0.0 0.000000 0.0 0.000000 0.0
90 0.000000 0.00 0.000000 0.0 0.0 0.5 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
91 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.5 0.000000 0.0 0.000000 0.0
92 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.5 0.000000 0.0 0.000000 0.0
93 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
94 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
95 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.5 0.0 0.000000 0.0 0.000000 0.0
96 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
97 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.5 0.0 0.000000 0.5 0.000000 0.0
98 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.5 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
99 0.000000 0.00 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
100 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0
101 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.055556 0.0
102 0.000000 0.00 0.000000 0.0 0.0 0.0 0.0 0.0 0.045455 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0

103 rows × 112 columns

Cool, that seems to work. Now just need to do it for the complete set of files. Just use beetle/train/core for the time being.


In [107]:
!ls semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/


semeval2013-Task7-2and3way url.txt
semeval2013-Task7-5way

Use os.walk to get the files:


In [114]:
import os

We can now do the same as before, but this time using all the files to construct the final dataframe. We also need a series containing the accuracy measures.


In [137]:
tokenDictionaries_ls=[]

# glob would have been easier...
for (root, dirs, files) in os.walk('semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/'):
    for filename in files:
        if filename[-4:]=='.xml':
            tokenDictionaries_ls.extend(extract_token_dictionaries(os.path.join(root, filename)))

# Now we've extracted the information from all the files. We can now construct the dataframe
# in the same way as before:

# Build the lists of responses:
tokenLists_ls=[x['tokens'] for x in tokenDictionaries_ls]

# Build the document frequency dict
docFreq_dict=document_frequencies(tokenLists_ls)

# Now, create a dataframe which is indexed by the tokens
# in the token frequency dictionary:
trainingText_df=pd.DataFrame(index=docFreq_dict.keys())

# Populate the dataframe with the tf.idf for each response. Also,
# create a dictionary of the accuracy values while we're at it.
accuracy_dict={}
for (i, response_dict) in enumerate(tokenDictionaries_ls):
    trainingText_df[i]=pd.Series(get_tfidf(response_dict['tokens'], docFreq_dict), 
                                 index=trainingText_df.index)
    accuracy_dict[i]=response_dict['accuracy']

# Finally, transpose, and replace the NaNs with 0:
trainingText_df=trainingText_df.fillna(0).T

# Also, to make it easier to store in a single csv file, let's put the accuracy
# values in a column (won't clash with any occurences of the token "accuracy" 
# because we've changed the tokens to "accuracy_SR" and "accuracy_RA":

trainingText_df['accuracy']=pd.Series(accuracy_dict)

In [138]:
trainingText_df.head()


Out[138]:
in_SR germinal_SR theya_SR affected_SR interruption_SR locate_SR see_SR cnnected_SR differnt_SR 3_RA ... components_SR to_RA means_SR s_SR burns_RA electrica_SR seriously_SR lit_SR difference_RA accuracy
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 correct
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 correct
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 contradictory
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 contradictory
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 contradictory

5 rows × 1117 columns

And finish by exporting to a csv file:


In [141]:
trainingText_df.to_csv('beetleTrainingData.csv', index=False)

Done! Now can import the data into a dataframe with:

pd.read_csv('beetleTrainingData.csv')